PISA is a is a survey of students' skills and knowledge as they approach the end of compulsory education. The dataset is is a worldwide study developed by the Organisation for Economic Co-operation and Development (OECD) which examines the skills of 15-year-old school students around the world. The study assesses students’ mathematics, science, and reading skills and contains a wealth of information on students’ background, their school and the organisation of education systems.
# import all packages and set plots to be embedded inline
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sb
plt.rcParams["axes.spines.right"] = False
plt.rcParams["axes.spines.top"] = False
%matplotlib inline
pisadict2012 = pd.read_csv("pisadict2012.csv", encoding='latin1', index_col=0)
pd.set_option('display.max_columns', 1000)
pd.set_option('display.max_rows', 100000)
pd.set_option('display.max_colwidth', 1000)
pisadict2012
pd.reset_option('all', True)
%%time
pisa2012 =pd.read_csv('pisa2012.csv', encoding='latin1', index_col=0,
error_bad_lines=False, warn_bad_lines=True,
low_memory=False, skiprows=[241337])
pd.set_option('display.max_columns', 1000)
pd.set_option('display.max_info_columns', 1000)
pisa2012.sample(10)
pisa2012.info()
boolean_columns = []
for column in pisa2012.select_dtypes('object').columns:
if pisa2012[column].nunique() == 2 and "Yes" in pisa2012[column].unique():
boolean_columns.append(column)
pisa2012[boolean_columns].sample(10)
pd.reset_option('all', True)
# check duplicates
pisa2012.duplicated().any()
pisa2012['NC'].unique()
pisa2012['CNT'].unique()
# check if the column are equal
pisa2012['CNT'].equals(pisa2012['NC'])
# Get total number of missing values
pisa2012.isnull().sum().sum()
pd.set_option('display.max_columns', 1000)
pd.set_option('display.max_rows', 1000)
pd.set_option('display.max_info_columns', 1000)
# Get the statistic of numerical columns
pisa2012.select_dtypes('number').describe().T
# Get the statistics of categorical columns
pisa2012.select_dtypes('object').describe().T
pd.reset_option('all', True)
pisa2012.shape
pisa2012['ST01Q01'].describe()
# correlation between mean international grade and schoolid ???
pisa2012.groupby(by=['SCHOOLID'], as_index=False)['ST01Q01'].mean().corr(method='pearson')
# Does performance in MATH depend on GENDER ???
(pisa2012
.groupby(by=['ST04Q01'], as_index=False)
['PV1MATH'].mean()
.rename(columns={'ST04Q01': 'Gender',
'PV1MATH': 'Mean Plausible value 1 in mathematics'})
)
# Does performance in MATH depend on OECD ???
(pisa2012
.groupby(by=['OECD'], as_index=False)
['PV1MATH'].mean()
.rename(columns={'OECD': 'OECD Country',
'PV1MATH': 'Mean Plausible value 1 in mathematics'})
)
# Does the Sense of Belonging - Belong at School affect a student's grades ???
(pisa2012
.groupby(by=['ST87Q03'], as_index=False)
['ST01Q01'].mean()
.rename(columns={'ST87Q03': "Sense of Belonging - Belong at School affect a student's grades",
'ST01Q01': 'Mean International grade'})
)
# Does Perceived Control - Problems Prevent from Putting Effort into School affect student grades ???
(pisa2012
.groupby(by=['ST91Q03'], as_index=False)
['ST01Q01'].mean()
.rename(columns={'ST91Q03': "Perceived Control - Problems Prevent from Putting Effort into School",
'ST01Q01': "Mean International grade"})
)
# Attitudes :: Attributions to Failure - Teacher Did Not Explain Well affect student grades ???
(pisa2012
.groupby(by=['ST44Q03'], as_index=False)
['ST01Q01'].mean()
.rename(columns={'ST44Q03': "Attributions to Failure - Teacher Did Not Explain Well",
'ST01Q01': "Mean International grade"})
)
# Attitudes :: Math Teaching - Teacher shows interest affect student grades ???
(pisa2012
.groupby(by=['ST77Q01'], as_index=False)
['ST01Q01'].mean()
.rename(columns={'ST77Q01': "Math Teaching - Teacher shows interest affect student grades",
'ST01Q01': "Mean International grade"})
)
# Practices :: Teacher-Directed Instruction - Encourages Thinking and Reasoning affect student grades ???
(pisa2012
.groupby('ST79Q02', as_index=False)
['ST01Q01'].mean()
.rename(columns={'ST79Q02': 'Teacher-Directed Instruction - Encourages Thinking and Reasoning',
'ST01Q01': 'Mean International grade'})
)
# Practices :: Teacher Support - Helps Students with Learning affect student grades ???
(pisa2012
.groupby(by=['ST83Q03'], as_index=False)
['ST01Q01'].mean()
.rename(columns={'ST83Q03': 'Teacher Support - Helps Students with Learning',
'ST01Q01': 'Mean International Grade'})
)
# Inequality :: Immigration status
(pisa2012
.groupby(by=['IMMIG'], as_index=False)
['ST01Q01'].mean()
.rename(columns={'IMMIG': 'Immigration status',
'ST01Q01': 'Mean International Grade'})
)
# Inequality :: Immigration status AND How many books at home
immig_book = \
pd.crosstab(index=pisa2012["ST28Q01"],
columns=pisa2012["IMMIG"],
colnames=['Immigration status'],
normalize=True,
margins=True, margins_name="All").style.format('{:.2%}')
immig_book.index.name='How many books at home'
immig_book
%%time
# Compute the average of plausible scores to determine the PISA score of a student in a particular subject
pisa2012["Math"] = (pisa2012['PV1MATH']
+ pisa2012['PV2MATH']
+ pisa2012['PV3MATH']
+ pisa2012['PV4MATH']
+ pisa2012['PV5MATH'])/5
pisa2012["Reading"] = (pisa2012['PV1READ']
+ pisa2012['PV2READ']
+ pisa2012['PV3READ']
+ pisa2012['PV4READ']
+ pisa2012['PV5READ'])/5
pisa2012["Science"] = (pisa2012['PV1SCIE']
+ pisa2012['PV2SCIE']
+ pisa2012['PV3SCIE']
+ pisa2012['PV4SCIE']
+ pisa2012['PV5SCIE'])/5
After some data assesments and reading through the data description at this source, there are 241336 observations in the dataset with 635 features. 30 countries are represented in our dataset. The dataset contains a list of indices which represent a particular feature to study and each list of indices is made of a list of items that create the indices.
Some interesting features are :
How does the choice of school play into academic performance?
Are there differences in achievement based on gender, location, or student attitudes?
Are there differences in achievement based on teacher practices and attitudes?
Does there exist inequality in academic achievement?
School climate indices and Attitudes towards school could help investigate how choice of school impact academic performance. Education level of parents, family structure, relative grade, household possessions, attitudes towards mathematics and opportunity to learn could help understand the differences in achievement whether based on gender, location, student attitudes or teacher practices and attitudes. Finally Highest occupational status of parents and immigration background could highlight the existence of inequality in academic achievement.
We first create some utilities functions.
def transform_categorical_column_to_be_ordered(df:pd.DataFrame, cat:list, column_to_transf:list)->pd.DataFrame:
"""
Given a dataframe with specified unordered categorical columns values to transform
return a new dataframe with new columns containing ordered categorical column values
"""
ordered_cat = pd.api.types.CategoricalDtype(categories=cat, ordered=True)
for column in column_to_transf:
df[column] = df[column].astype(ordered_cat)
def annotate_barplot(value_counts:pd.Series, position=True, ylabel='Percentage', xlabel='Percentage'):
"""
Annotate the barplot given a pandas series which represents value counts of categories in the column
"""
if position:
for i,v in enumerate(value_counts):
plt.text(i, v, str(round(v, 4)), fontdict=dict(color='black', fontsize=12), va='bottom');
plt.ylabel(ylabel);
else:
for i,v in enumerate(value_counts):
plt.text(v, i, str(round(v, 4)), fontdict=dict(color='black', fontsize=12), ha='left');
plt.xlabel(xlabel);
def summary_barplot_items(df:pd.DataFrame, col:list, suptitle:str, title:list,
nr=1, nc=1,
figh=5, figw=12,
fig_a_top=.8, fig_a_wspace=.2, fig_a_hspace=.9,
color=["tab:blue"], vert=True):
"""
Given some parameters such as the dataframe, the list of columns to plot and the details about
how many axes to use plot and adjust the plotting parameters.
"""
fig, axes = plt.subplots(nrows=nr, ncols=nc, figsize=(figh, figw))
fig.suptitle(t=suptitle, x = 0.5, y = 0.95, fontsize = 20, fontweight='bold', color = 'tab:blue')
axes_ = axes.flatten()
if vert:
for idx, c in enumerate(col):
_counts = df[c].value_counts(normalize=True).sort_index()
_counts.plot(kind='barh', color=color,
rot=0, title=title[idx], ax=axes_[idx]);
axes_[idx].set_xlabel("Percentage");
fig.subplots_adjust(top = fig_a_top, wspace=fig_a_wspace , hspace=fig_a_hspace);
else:
for idx, c in enumerate(col):
_counts = df[c].value_counts(normalize=True).sort_index()
_counts.plot(kind='bar', color=color,
rot=0, title=title[idx], ax=axes_[idx]);
axes_[idx].set_ylabel("Percentage");
fig.subplots_adjust(top = fig_a_top, wspace=fig_a_wspace , hspace=fig_a_hspace);
return axes_
We start by exploring the distribution of countries that took part in the study.
cnt_counts = pisa2012['CNT'].value_counts(normalize=True, ascending=True)
cnt_counts.plot(kind='barh', figsize=(12,10), color='tab:blue',
title='Percentage of observations per country');
annotate_barplot(cnt_counts, position=False)
Italy is the country with the lowest percentage of data in the study and the the highest is Spain.
We explore here the distribution of countries belonging to OECD in the study.
oecd_counts = pisa2012['OECD'].value_counts(ascending=True)
oecd_counts.plot(kind='pie',
startangle=90,
autopct='%.2f%%',
figsize=(8,5),
counterclock=False,
title='Distributions of OECD observations.',
labels=["", ""]);
plt.legend(oecd_counts.index.tolist(), loc="best");
plt.ylabel("");
plt.axis('square');
Approximately 68.67% percent of countries are members of the OECD and 31.33 are Non-OECD.
We explore here how the population of the study is distributed per gender.
st04q01_counts = pisa2012['ST04Q01'].value_counts(ascending=True)
st04q01_counts.plot(kind='pie',
startangle=90,
autopct='%.2f%%',
figsize=(8,5),
counterclock=False,
title='Distributions of Gender observations.',
labels=["", ""]);
plt.legend(st04q01_counts.index.tolist(), loc="best");
plt.ylabel("");
plt.axis('square');
The data is equally distributed in terms of Gender with 50.66% of observations that are Male and 49.34% are Female.
We explore here the distribution of students performance in the subject Math.
fig, axs = plt.subplots(nrows=1, ncols=2, sharex=False, sharey=False, figsize=(12,5))
fig.suptitle(t='Distribution & Boxplot of Maths Score (red line indicates mean)',
fontweight='bold', fontsize=20, color="tab:blue")
binsize = 10
pv1math= pisa2012['Math']
bins = np.arange(0, pv1math.max()+binsize, binsize)
axs[0].axvline(pv1math.mean(), color='tab:red');
pv1math.plot(kind='hist',
bins=bins,
ax= axs[0]
);
axs[0].set_xlabel('Math Score');
pv1math.plot(kind='box',
vert=False,
ax= axs[1]
);
axs[1].set_yticks([], [])
axs[1].set_xlabel("Math Score");
The Maths score is normally distributed with a mean of 472 and standard deviation of 99.
The Maths score present a lot of outliers.
We explore the distribution of students reading score.
fig, axs = plt.subplots(nrows=1, ncols=2, sharex=False, sharey=False, figsize=(12,5))
fig.suptitle(t='Distribution & Boxplot of Reading Score (red line indicates mean)',
fontweight='bold', fontsize=20, color="tab:blue")
binsize = 10
pv1read= pisa2012['Reading']
bins = np.arange(0, pv1read.max()+binsize, binsize)
pv1read.plot(kind='hist',
bins=bins,
ax=axs[0]
);
axs[0].set_xlabel('Reading Score');
axs[0].axvline(pv1read.mean(), color='tab:red');
pv1read.plot(kind='box',
vert=False,
ax=axs[1]
);
axs[1].set_yticks([], [])
axs[1].set_xlabel("Reading Score");
The Reading scorre is normally distributed with a mean of 477 and a standard deviation of 99.
The reading score present a lot of outliers.
We explore the distribution of students performance in the subject Science.
fig, axs = plt.subplots(nrows=1, ncols=2, sharex=False, sharey=False, figsize=(12,5))
fig.suptitle(t='Distribution & Boxplot of Science Score (red line indicates mean)',
fontweight='bold', fontsize=20, color="tab:blue")
binsize = 10
pv1scie= pisa2012['Science']
bins = np.arange(0, pv1scie.max()+binsize, binsize)
pv1scie.plot(kind='hist',
bins=bins,
ax= axs[0]
);
axs[0].set_xlabel('Science Score');
axs[0].axvline(pv1scie.mean(), color='tab:red');
pv1scie.plot(kind='box',
vert=False,
ax = axs[1]
);
axs[1].set_xlabel("Science Score");
axs[1].set_yticks([],[]);
The science score is normally distributed with a mean of 482 and a standard deviation of 98.
The science socre present a lot of outliers.
Below, we explore the distribution of the student Grade compared to modal grade in country.
From the definition of the column GRADE the value are in the range [-grade,+grade] and 0 being the grade that are the modal grade of the country.
pisa2012['GRADE'].value_counts(normalize=True)
pisa2012['GRADE'].astype(str).unique()
pisa2012['GRADE'] = pisa2012['GRADE'].astype(str)
grade = ['-3.0', '-2.0', '-1.0', '0.0', '1.0', '2.0']
transform_categorical_column_to_be_ordered(pisa2012, grade, ['GRADE'])
grade_counts = pisa2012['GRADE'].value_counts(normalize=True).sort_index()
grade_counts.plot(kind='bar', figsize=(8,5), title='Distribution of GRADE observations', color="tab:blue", rot=0);
plt.xlabel("GRADE");
annotate_barplot(grade_counts)
From the graph we could see that a lot of students are at the modal grade of the country (value of 0) and very few are far from the grade (value less or great than 0).
We explore here the distribution of immigrants status and Country of Birth International for the students, father and mother.
immig_counts = pisa2012['IMMIG'].value_counts(normalize=True, ascending=False)
immig_counts.plot(kind='bar', color='tab:blue', rot=0, title='Percentage of immigrants status');
annotate_barplot(immig_counts)
From the visualization we could tell that majority of students who took the assessments were Native and approximately 6.7% were from second and first generation
# Visualizing top 10 countries of country of birth national of students and their parents
fig = plt.figure(figsize=(19,8))
fig.add_subplot(131)
cobnf_counts = pisa2012['COBN_F'].value_counts(normalize=True, ascending=False)[:10]
cobnf_counts.plot(kind='barh', color='tab:blue', rot=0, title='Country of Birth National Categories- Father');
annotate_barplot(cobnf_counts, position=False)
fig.add_subplot(132)
cobnm_counts = pisa2012['COBN_M'].value_counts(normalize=True, ascending=False)[:10]
cobnm_counts.plot(kind='barh', color='tab:blue', rot=0, title='Country of Birth National Categories- Mother');
annotate_barplot(cobnm_counts, position=False)
fig.add_subplot(133)
cobns_counts = pisa2012['COBN_S'].value_counts(normalize=True, ascending=False)[:10]
cobns_counts.plot(kind='barh', color='tab:blue', rot=0, title='Country of Birth National Categories- Self');
annotate_barplot(cobns_counts, position=False)
plt.subplots_adjust(top=.88, hspace=.2, wspace=.9);
fig.suptitle(t="Distribution Country of Birth National of students and parents",
fontsize=20, fontweight="bold", color="tab:blue");
## Full visualization
# cobnf_counts = pisa2012['COBN_F'].value_counts(normalize=True, ascending=False)
# cobnf_counts.plot(kind='barh', color='tab:blue', rot=0, title='Country of Birth National Categories- Father',
# figsize=(39,31));
# annotate_barplot(cobnf_counts, position=False)
## Full visualization
# cobnm_counts = pisa2012['COBN_M'].value_counts(normalize=True, ascending=False)
# cobnm_counts.plot(kind='barh', color='tab:blue', rot=0, title='Country of Birth National Categories- Mother',
# figsize=(39,31));
# annotate_barplot(cobnm_counts, position=False)
## Full visualization
# cobns_counts = pisa2012['COBN_S'].value_counts(normalize=True, ascending=False)
# cobns_counts.plot(kind='barh', color='tab:blue', rot=0, title='Country of Birth National Categories- Self',
# figsize=(39,31));
# annotate_barplot(cobns_counts, position=False)
From the visualization above we could notice that the top 2 countries were the native students parents and themselves originates from are Canada, Brazil. The least country were the native students orignate from is Bangladesh.
col = ['ST20Q01', 'ST20Q02', 'ST20Q03']
titles = ["Students", "Mother", "Father"]
axes = summary_barplot_items(pisa2012, col, "Country of birth international Percentage", titles,
nr=1, nc=3,figh=19, figw=5)
for i in [1,2]:
axes[i].set_yticks([],[]);
Exploring further the background of the students we observe from the visualization above that the majority of native students and parents originates from the country of the test and less than approximately 13% comes from other country.
We explore here the distribution of family structure in the dataset.
From the definition of the column FAMSTRUC it is a categorical column which indicates the number of parents living with the student.
pisa2012['FAMSTRUC'].value_counts()
# convert to the colum to string
pisa2012['FAMSTRUC'] = pisa2012['FAMSTRUC'].astype(str)
famstruc_cat = ['1.0', '2.0', '3.0']
transform_categorical_column_to_be_ordered(pisa2012, famstruc_cat, ['FAMSTRUC'])
famstruc_counts = pisa2012['FAMSTRUC'].value_counts(normalize=True).sort_index()
famstruc_counts.plot(kind='bar', color="tab:blue", rot=0, title="Distribution of family structure observations");
annotate_barplot(famstruc_counts)
The above visualization tells us that the majority of students are with 2 family members and the minority with 3.
col = ['ST11Q01', 'ST11Q02', 'ST11Q03', 'ST11Q04', 'ST11Q05', 'ST11Q06']
titles = ["Mother", "Father", "Brothers", "Sisters", "Grandparents", "Others"]
axes = summary_barplot_items(pisa2012, col, "Distribution of relatives at home", titles,
vert=False, color=["tab:orange", "tab:blue"],
nr=2, nc=3,figh=12, figw=10, fig_a_hspace=.5)
for i in [1,2,4,5]:
axes[i].set_ylabel("");
We clearly see here that Mother and Father are the parents that are majorly at home while Grandparents and Others are not so often in the lives of the students.
We explore here the distribution of the highest occupational status of parents.
Before exploring we adjust the categories of schooling and job so it contains ordered categories.
Note that we suppose nan here are observations that were not specified during the study.
axes = pisa2012[['BFMJ2', 'BMMJ1', 'HISEI']].plot(kind='hist', sharex=True, sharey=False,
bins=50,
subplots=True,
color="tab:blue",
figsize=(12,8),
legend=True)
axes[0].legend(["Father"])
axes[1].legend(["Mother"])
axes[2].legend(["Parents"]);
plt.suptitle(t="Summary of Occupational status", fontsize = 20, fontweight='bold', color = 'tab:blue');
We could see that the highest occupation of the Father, Mother and Parents varies a lot. And the highest occupations score (approximately 90) counts less than 2500 observations with the category of mother accounting with the least observation. On the other side the score 30 has recorded the highest population across all categories.
According to the definitions of the education level of parents columns FISCED, MISCED, HISCED the data could be cateogrised as follow :
pisa2012['FISCED'].unique()
educ_level_cat = ['None','ISCED 1', 'ISCED 2', 'ISCED 3B, C', 'ISCED 3A, ISCED 4', 'ISCED 5B', 'ISCED 5A, 6']
transform_categorical_column_to_be_ordered(pisa2012, educ_level_cat, ['FISCED', 'MISCED', 'HISCED'])
col = ['FISCED', 'MISCED', 'HISCED']
titles = ["Father", "Mother", "Parents"]
axes = summary_barplot_items(pisa2012, col, "Educational level (ISCED)", titles,
nr=1, nc=3,figh=12, figw=5, color=["tab:red", "tab:blue", "tab:blue",
"tab:blue", "tab:blue", "tab:blue", "tab:blue"])
for i in [1,2]:
axes[i].set_yticks([],[]);
From the visualization less than 5% of parent have none edcuation level and majority have a theoretically oriented tertiary and post-graduate education level.
schooling_order_mother = ['She did not complete <ISCED level 1> ',
'<ISCED level 1> ','<ISCED level 2> ','<ISCED level 3A> ','<ISCED level 3B, 3C> ']
transform_categorical_column_to_be_ordered(pisa2012, schooling_order_mother, ['ST13Q01'])
schooling_order_father = ['He did not complete <ISCED level 1> ',
'<ISCED level 1> ','<ISCED level 2> ','<ISCED level 3A> ','<ISCED level 3B, 3C> ']
transform_categorical_column_to_be_ordered(pisa2012, schooling_order_father, ['ST17Q01'])
pisa2012['ST15Q01'].unique().tolist()
col = ['ST13Q01', 'ST17Q01', 'ST15Q01', 'ST19Q01']
titles = ["Mother Highest Schooling", "Father Highest Schooling", "Mother Current Job Status", "Father Current Job Status"]
axes = summary_barplot_items(pisa2012, col, "Distribution of parents education and job", titles,
nr=2, nc=2,figh=12, figw=5, color=["tab:blue", "tab:blue", "tab:blue", "tab:green", "tab:blue"])
for i in [1,3]:
axes[i].set_yticks([],[]);
From the above visualization, most of the students (more than 50%) have parents with general upper secondary highest schooling and a working full-time job status.
Using the definition of variables provided in the source, we explore family wealth in terms of
where the summary of the following items account for the home possessions.
axs = pisa2012[['WEALTH', 'CULTPOS', 'HEDRES', 'HOMEPOS']].plot(kind='hist', sharex=True, sharey=False,
bins=8,
subplots=True,
figsize=(5,12),
color="tab:blue",
legend=False);
axs[0].legend(["Family wealth possessions"])
axs[1].legend(["Cultural possessions"])
axs[2].legend(["Home educational resources"])
axs[3].legend(["Home possessions"]);
plt.suptitle(t="Summary of the distribution of family wealth indices",
fontsize = 20, fontweight='bold', color = 'tab:blue');
We could observe that the family wealth and home possessions indices seems to be normally distributed while the home educational resources and cultural possessions are left-skewed.
col = ['ST26Q01', 'ST26Q02', 'ST26Q03', 'ST26Q04', 'ST26Q05', 'ST26Q06', 'ST26Q07',
'ST26Q08', 'ST26Q09', 'ST26Q10', 'ST26Q11', 'ST26Q12', 'ST26Q13', 'ST26Q14']
titles = ["Desk", "Own room", "Study place", "Computer", "Software", "Internet", "Literature",
"Poetry", 'Art', "Textbook", "Technical reference books", "Dictionary", "dishwasher", "DVD"]
axes = summary_barplot_items(pisa2012, col, "Summary of possession distribution", titles,
nr=3, nc=5,figh=12, figw=10, color=["tab:orange", "tab:blue"],
vert=False, fig_a_hspace=.6, fig_a_wspace=.9, fig_a_top=.8)
for i in [1,2,3,4,6,7,8,9,11,12,13]:
axes[i].set_ylabel("");
plt.delaxes(axes[-1]);
From the summary of possession distribution, parents are clearly undecided in posessing more or less Literature, poetry, art, technical reference book, software at home. On the other side, there is a predominance of desk, own room, study place, computer, internet, dictionary, dishwasher, DVD at home.
pisa2012["ST27Q01"].unique()
background = ['None', 'One', 'Two', 'Three or more']
transform_categorical_column_to_be_ordered(pisa2012, background, ['ST27Q01', 'ST27Q02', 'ST27Q03', 'ST27Q04', 'ST27Q05'])
pisa2012['ST28Q01'].unique()
num_book = ['0-10 books ', '11-25 books ', '26-100 books ',
'101-200 books ', '201-500 books ', 'More than 500 books']
transform_categorical_column_to_be_ordered(pisa2012, num_book, ['ST28Q01'])
col = ['ST27Q01', 'ST27Q02', 'ST27Q03', 'ST27Q04', 'ST27Q05', 'ST28Q01']
titles = ["Cellular Phones", "Television", "Computers", "Cars", "Rooms bath or shower", "Books at home"]
axes = summary_barplot_items(pisa2012, col, "Summary of the parameters for Home background", titles,
nr=2, nc=5,figh=19, figw=10, color=["tab:blue"],
vert=True, fig_a_hspace=.4, fig_a_wspace=.2)
for i in [1,2,3,4]:
axes[i].set_yticks([],[]);
plt.delaxes(axes[-1]);
plt.delaxes(axes[-2]);
plt.delaxes(axes[-3]);
plt.delaxes(axes[-4]);
In terms of the home background, there are few home where we don't have cellular phones, television, computers and rooms bath or shower. And the perecentage of those equipements is below 10%. On the other side, approximately 30% of family possess 26-100 books.
Per the definitons given in the source the attitude towards mathematics could be interpreted from the following indices:
axs = (pisa2012[['INSTMOT','INTMAT', 'SUBNORM', 'MATHEFF', 'ANXMAT', 'SCMAT', 'FAILMAT', 'MATWKETH', 'MATINTFC', 'MATBEH']]
.plot(kind='hist',
bins=10,
subplots=True,
figsize=(5,20),
color='tab:blue',
sharex=True, sharey=True,
))
;
axs[0].legend(["Instrumental Motivation for Mathematics"])
axs[1].legend(["Mathematics Interest"])
axs[2].legend(["Subjective Norms in Mathematics"])
axs[3].legend(["Mathematics Self-Efficacy"])
axs[4].legend(["Mathematics Anxiety"])
axs[5].legend(["Mathematics Self-Concept"])
axs[6].legend(["Attributions to Failure in Mathematics"])
axs[7].legend(["Mathematics Work Ethic"])
axs[8].legend(["Mathematics Intentions"])
axs[9].legend(["Mathematics Behaviour"])
plt.suptitle(t="Summary of Attitudes toward mathematics indices", fontsize = 20, fontweight='bold', color = 'tab:blue');
plt.subplots_adjust(top=.92, hspace=.2);
The attitudes toward mathematics seems to be normally distributed across all indices.
In the following cells we will dig into the list of items that consitute each index.
pisa2012['ST29Q01'].unique()
agreement = ['Strongly disagree','Disagree','Agree','Strongly agree']
transform_categorical_column_to_be_ordered(pisa2012, agreement, ['ST29Q01', 'ST29Q03', 'ST29Q04', 'ST29Q06'])
col = ['ST29Q01', 'ST29Q03', 'ST29Q04', 'ST29Q06']
titles = ["Enjoy Reading", "Look Forward to Lessons", "Enjoy Maths", "Interested"]
axes = summary_barplot_items(pisa2012, col, "Summary of the parameters for Mathematics Interest", titles,
nr=1, nc=4,figh=15, figw=5, color=["tab:blue"],
vert=True, fig_a_hspace=.4, fig_a_wspace=.2)
for i in [1,2,3]:
axes[i].set_yticks([],[]);
Visualizing students interst in mathematics, less than 10% strongly agree to enjoy reading and math, and are interested or look forward to lessons. On the opposite, approximately 30% Disagree. The visualization doesn't really show clear information but it tends to communicate that there are more negativity than positivity about students interest for mathematics.
pisa2012['ST29Q02'].unique()
transform_categorical_column_to_be_ordered(pisa2012, agreement, ['ST29Q02', 'ST29Q05', 'ST29Q07', 'ST29Q08'])
col = ['ST29Q02', 'ST29Q05', 'ST29Q07', 'ST29Q08']
titles = ["Worthwhile for Work", "Worthwhile for Career Chances", "Important for Future Study", "Helps to Get a Job"]
axes = summary_barplot_items(pisa2012, col, "Summary of the parameters for instrumental motivation", titles,
nr=1, nc=4,figh=15, figw=5, color=["tab:red", "tab:blue", "tab:blue", "tab:blue"],
vert=True, fig_a_hspace=.4, fig_a_wspace=.2)
for i in [1,2,3]:
axes[i].set_yticks([],[]);
We clearly see here that less than 10% of students strongly disagree that their instrumental motivation are: worthwhile for work, career chances, Important for Future Study, Helps to Get a Job. On the other hand, approximately 40% of students agree about their instrumental motivation for work, career chancees, future study and job opportunity.
col = ['ST48Q01', 'ST48Q02', 'ST48Q03', 'ST48Q04', 'ST48Q05']
titles = ["Mathematics vs. Language Courses After School", "Mathematics vs. Science Related Major in College",
"Study Harder in Mathematics vs. Language Classes", "Take Maximum Number of Mathematics vs. Science Classes",
"Pursuing a Career That Involves Mathematics vs. Science"]
axes = summary_barplot_items(pisa2012, col, "Summary of the parameters for Mathematics intentions", titles,
nr=5, nc=1,figh=5, figw=15, color=["tab:blue"],
vert=True, fig_a_hspace=.4, fig_a_wspace=.2, fig_a_top=.9)
From the visualization above it's not very clear what are students mathematics intentions.
pisa2012['ST37Q01'].unique()
pisa2012['ST37Q05'].unique()
self_efficacy_order = ['Not at all confident', 'Not very confident','Confident', 'Very confident']
transform_categorical_column_to_be_ordered(pisa2012,
self_efficacy_order,
['ST37Q01', 'ST37Q02', 'ST37Q03', 'ST37Q04',
'ST37Q05', 'ST37Q06', 'ST37Q07','ST37Q08'])
col = ['ST37Q01', 'ST37Q02', 'ST37Q03', 'ST37Q04', 'ST37Q05', 'ST37Q06', 'ST37Q07', 'ST37Q08']
titles = ["Using a Train Timetable", "Calculating TV Discount", "Calculating Square Metres of Tiles",
"Understanding Graphs in Newspapers", 'Solving Equation 1', "Distance to Scale",
"Solving Equation 2", "Calculate Petrol Consumption Rate"]
axes = summary_barplot_items(pisa2012, col, "Summary of the parameters for Math Self-Efficacy", titles,
nr=2, nc=4,figh=19, figw=10, color=["tab:red","tab:blue", "tab:blue", "tab:blue"],
vert=True, fig_a_hspace=.4, fig_a_wspace=.2)
for i in [1,2,3,5,6,7]:
axes[i].set_yticks([],[]);
For all parameters investigated, less 10% of students are not at all confident about Math self-efficacy. Even though the parameters like Distance to Scale, and Calculate Petrol consumption rate show a high perecent of Not-very confident.
pisa2012['ST42Q01'].unique()
math_anxiety_order = ['Strongly disagree', 'Disagree', 'Agree', 'Strongly agree']
transform_categorical_column_to_be_ordered(pisa2012,
math_anxiety_order,
['ST42Q01', 'ST42Q03', 'ST42Q05', 'ST42Q08', 'ST42Q10'])
col = ['ST42Q01', 'ST42Q03', 'ST42Q05', 'ST42Q08', 'ST42Q10']
titles = ["Worry That It Will Be Difficult", "Get Very Tense", "Get Very Nervous",
"Feel Helpless", 'Worry About Getting Poor Grades',]
axes = summary_barplot_items(pisa2012, col, "Summary of the parameters for Mathematics anxiety", titles,
nr=1, nc=5,figh=15, figw=5, color=["tab:blue"],
vert=True, fig_a_hspace=.4, fig_a_wspace=.2)
for i in [1,2,3,4]:
axes[i].set_yticks([],[]);
For the parameters Worry that it will be difficuult, worry about getting poor grades students answered Agree with the highest percentage. For the parameters like get very tense, get very nervous, feel helpless students answered Disagree with the nighest percentage. Overall we can't tell anything about students mathematics anxiety.
pisa2012['ST42Q09'].unique()
self_concept_order = ['Strongly disagree', 'Disagree', 'Agree', 'Strongly agree']
transform_categorical_column_to_be_ordered(pisa2012, self_concept_order,
['ST42Q02', 'ST42Q04', 'ST42Q06', 'ST42Q07', 'ST42Q09'])
col = ['ST42Q02', 'ST42Q04', 'ST42Q06', 'ST42Q07', 'ST42Q09']
titles = ["Not Good at Maths", "Get Good Grades", "Learn Quickly",
"One of Best Subjects", 'Understand Difficult Work']
axes = summary_barplot_items(pisa2012, col, "Summary of the parameters for Mathematics self-concept", titles,
nr=1, nc=5,figh=15, figw=5, color=["tab:blue"],
vert=True, fig_a_hspace=.4, fig_a_wspace=.2)
for i in [1,2,3,4]:
axes[i].set_yticks([],[]);
For the mathematics self-concept parameters, students responded Agree with high percentage for parameters such as Get good grades and Learn quickly. And they responded Disagree with high percentage for Not Good at Maths, One Best Subjects and Understand difficult work. Overall, it is not clear what is their position for Mathematics self-concept.
pisa2012['ST44Q08'].unique()
failure_order = ['Not at all likely','Slightly likely','Likely','Very Likely']
transform_categorical_column_to_be_ordered(pisa2012, failure_order,
['ST44Q01', 'ST44Q03', 'ST44Q04', 'ST44Q05', 'ST44Q07', 'ST44Q08'])
col = ['ST44Q01', 'ST44Q03', 'ST44Q04', 'ST44Q05', 'ST44Q07', 'ST44Q08']
titles = ["Not Good at Maths Problems", "Teacher Did Not Explain Well", "Bad Guesses",
"Material Too Hard", 'Teacher Didnt Get Students Interested', 'Unlucky']
axes = summary_barplot_items(pisa2012, col, "Summary of the parameters for Failure in mathematics", titles,
nr=2, nc=3, figh=19, figw=10, color=["tab:blue", "tab:blue", "tab:green", "tab:blue"],
vert=True, fig_a_hspace=.4, fig_a_wspace=.2)
for i in [1,2,4,5]:
axes[i].set_yticks([],[]);
Overall, students are likely to fail in mathematics for the following parameters : Not good at maths problems, teacher did not explain well, bad guesses, material too hard, teacher didn't get students interested and unlucky.
pisa2012['ST46Q09'].unique()
work_ethic_order = ['Strongly disagree', 'Disagree', 'Agree', 'Strongly agree']
transform_categorical_column_to_be_ordered(pisa2012, work_ethic_order,
['ST46Q01', 'ST46Q02', 'ST46Q03', 'ST46Q04', 'ST46Q05',
'ST46Q06', 'ST46Q07', 'ST46Q08', 'ST46Q09'])
col = ['ST46Q01', 'ST46Q02', 'ST46Q03', 'ST46Q04', 'ST46Q05', 'ST46Q06', 'ST46Q07', 'ST46Q08', 'ST46Q09']
titles = ["Homework Completed in Time", "Work Hard on Homework", "Prepared for Exams",
"Study Hard for Quizzes", 'Study Until I Understand Everything', "Pay Attention in Classes",
'Listen in Classes', "Avoid Distractions When Studying", "Keep Work Organized"]
axes = summary_barplot_items(pisa2012, col, "Summary of the parameters for Mathematics work ethic", titles,
nr=2, nc=5, figh=19, figw=10, color=["tab:red", "tab:blue", "tab:green", "tab:blue"],
vert=True, fig_a_hspace=.4, fig_a_wspace=.2)
for i in [1,2,3,4,6,7,8,9]:
axes[i].set_yticks([],[]);
plt.delaxes(axes[9]);
Overall, majority of students who Agree believes that their Mathematics work ethic are due to : Homework completed in time, work hard on homework, prepared for exams, study hard for quizzes, study until i understand everything, pay attention in classes, listen in classes, avoid distractions when studying and keep work organized. Minority of students who Strongly disagree thinks the same about their mathematics work ethic.
pisa2012['ST49Q09'].unique()
behavoir_order = ['Never or rarely', 'Sometimes', 'Often', 'Always or almost always']
transform_categorical_column_to_be_ordered(pisa2012, behavoir_order,
['ST49Q01', 'ST49Q02', 'ST49Q03', 'ST49Q04',
'ST49Q05', 'ST49Q06', 'ST49Q07', 'ST49Q09'])
col = ['ST49Q01', 'ST49Q02', 'ST49Q03', 'ST49Q04', 'ST49Q05', 'ST49Q06', 'ST49Q07', 'ST49Q09']
titles = ["Talk about Maths with Friends", "Help Friends with Maths", "<Extracurricular> Activity",
"Participate in Competitions", 'Study More Than 2 Extra Hours a Day', "Play Chess",
'Computer programming', "Participate in Math Club"]
axes = summary_barplot_items(pisa2012, col, "Summary of the parameters for Mathematics behavior", titles,
nr=2, nc=4, figh=19, figw=10, color=["tab:blue","tab:blue", "tab:blue", "tab:green"],
vert=True, fig_a_hspace=.4, fig_a_wspace=.2)
for i in [1,2,3,5,6,7]:
axes[i].set_yticks([],[]);
Minority of students who answered Always or almost always thinks their mathematics behavior could be explained by:
Talk about Maths with friends, help friends with maths, extracuricular activity, participate in competitions, study more than 2 extra hours a day, play chess, computer programming.
According to the definition of variables related to opportunity to learn it could be decomposed as follow :
axs = pisa2012[['EXAPPLM', 'EXPUREM', 'FAMCON', 'FAMCONC']].plot(kind='hist',
bins=10,
sharex=True, sharey=False,
legend=False,
subplots=True,
color="tab:blue",
figsize=(5,12))
axs[0].legend(["Applied Mathematics Tasks at School"])
axs[1].legend(["Pure Mathematics Tasks at School"])
axs[2].legend(["Mathematical Concepts"])
axs[3].legend(["Mathematical Concepts (Signal Detection Adjusted)"]);
plt.suptitle(t="Distribution of Experience & Familiarity with", fontsize = 20, fontweight='bold', color = 'tab:blue');
plt.subplots_adjust(top=.92, hspace=.2);
The distribution of experience and familiarity with parameters are allnormally distributed except the parameter : exprience with pure mathematics tasks at school.
axs= pisa2012[['TCHBEHTD', 'TCHBEHFA', 'TCHBEHSO']].plot(kind='hist', bins=10,
subplots=True,
legend=False,
sharex=True, sharey=False,
color="tab:blue",
figsize=(5,12)
)
axs[0].legend(["Formative Assessment"])
axs[1].legend(["Student Orientation"])
axs[2].legend(["Teacher-directed Instruction"]);
plt.suptitle(t="Distributon of Teacher Behavior", fontsize = 20, fontweight='bold', color = 'tab:blue');
plt.subplots_adjust(top=.92, hspace=.2);
The distribution of teacher behavior are all normally distributed.
axs = pisa2012[['TEACHSUP', 'COGACT', 'MTSUP', 'CLSMAN', 'DISCLIMA']].plot(kind='hist',
bins=8,legend=False,
subplots=True,
sharex=True, sharey=False,
color="tab:blue",
figsize=(5,12)
)
axs[0].legend(["Teacher Support"])
axs[1].legend(["Cognitive Activation in Mathematics Lessons"])
axs[2].legend(["Mathematics Teacher's Support"])
axs[3].legend(["Mathematics Teacher's Classroom Management"])
axs[4].legend(["Disciplinary Climate"])
plt.suptitle(t="Distribution of Teaching quality", fontsize = 20, fontweight='bold', color = 'tab:blue');
plt.subplots_adjust(top=.92, hspace=.2);
The distribution of teaching quality seems to be normally distributed.
pisa2012['ST61Q09'].unique()
experience_and_familiarity_order = ['Frequently', 'Sometimes', 'Rarely ', 'Never ']
transform_categorical_column_to_be_ordered(pisa2012, experience_and_familiarity_order,
['ST61Q01', 'ST61Q02', 'ST61Q03', 'ST61Q04',
'ST61Q05', 'ST61Q06', 'ST61Q07','ST61Q08', 'ST61Q09'])
col = ['ST61Q01', 'ST61Q02', 'ST61Q03', 'ST61Q04', 'ST61Q06', 'ST61Q08']
titles = ["Use <Train Timetable>", "Calculate Price including Tax", "Calculate Square Metres",
"Understand Scientific Tables", 'Use a Map to Calculate Distance', "Calculate Power Consumption Rate"]
axes = summary_barplot_items(pisa2012, col, "Summary of the parameters for applied mathematics tasks at school", titles,
nr=2, nc=3, figh=19, figw=10, color=["tab:blue","tab:orange","tab:blue","tab:blue"],
vert=True, fig_a_hspace=.4, fig_a_wspace=.2)
for i in [1,2,4,5]:
axes[i].set_yticks([],[]);
Except the parameter Calculate power consumption rate, majority of students believe they applied mathematics tasks at school sometimes at this occasions : Use train timetable, calculate price including tax, understand scientific tables, use a map to calculate distance and calculate power consumption rate.
col = ['ST61Q05', 'ST61Q07', 'ST61Q09']
titles = ["Solve Equation 1", "Solve Equation 2", "Solve Equation 3"]
axes = summary_barplot_items(pisa2012, col,
"Summary of the parameters for Experience with Applied Maths Tasks", titles,
nr=1, nc=3, figh=12, figw=5, color=["tab:blue"],
vert=True, fig_a_hspace=.4, fig_a_wspace=.2)
for i in [1,2]:
axes[i].set_yticks([],[]);
More that 50% of students who frequently solve equation 1,2 and 3 have experience with applied maths tasks.
pisa2012['ST62Q01'].unique()
Familiarity_with_math_order = ['Never heard of it', 'Heard of it once or twice', 'Heard of it a few times',
'Heard of it often', 'Know it well, understand the concept']
familiarity_cols = ['ST62Q01', 'ST62Q02', 'ST62Q03', 'ST62Q06', 'ST62Q07', 'ST62Q08', 'ST62Q09', 'ST62Q10',
'ST62Q12', 'ST62Q15', 'ST62Q16', 'ST62Q17', 'ST62Q19', 'ST62Q04', 'ST62Q11', 'ST62Q13']
transform_categorical_column_to_be_ordered(pisa2012, Familiarity_with_math_order, familiarity_cols)
col = familiarity_cols
titles = ["Exponential Function", "Divisor", "Quadratic Function", "Linear Equation", "Vectors", "Complex number",
"Rational number", 'Radicals', "Polygon", "Congruent Figure", "Cosine", "Arithmetic Mean",
"Probability", "Proper Number", "Subjunctive Scaling", "Declarative Function"]
axes = summary_barplot_items(pisa2012, col,
"Summary of the parameters for Familiarity with Math concepts", titles,
nr=4, nc=4, figh=19, figw=10, color=["tab:blue"],
vert=True, fig_a_hspace=.9, fig_a_wspace=.2)
for i in [1,2,3,5,6,7,9,10,11,13,14,15]:
axes[i].set_yticks([],[]);
From the Familiarity with Math concepts., we could see that students answered know it well, understand the concept in high percentage for the subject like : Divisor, quadractic function , linear equation, rational number, radicals, polygon, cosine, probability.
pisa2012['ST79Q11'].unique()
teacher_behaviour_order = ['Never or Hardly Ever', 'Some Lessons', 'Most Lessons', 'Every Lesson']
teacher_behaviour_cols = ['ST79Q01', 'ST79Q02', 'ST79Q06', 'ST79Q08', 'ST79Q15', 'ST79Q03', 'ST79Q04',
'ST79Q07', 'ST79Q10', 'ST79Q05', 'ST79Q11', 'ST79Q12', 'ST79Q17']
transform_categorical_column_to_be_ordered(pisa2012, teacher_behaviour_order, teacher_behaviour_cols)
col = ['ST79Q01', 'ST79Q02', 'ST79Q06', 'ST79Q08', 'ST79Q15']
titles = ["Sets Clear Goals", "Encourages Thinking and Reasoning", "Checks Understanding", "Summarizes Previous Lessons",
"Informs about Learning Goals"]
axes = summary_barplot_items(pisa2012, col, "Summary of the parameters for Teacher Behavior", titles,
nr=1, nc=5,figh=19, figw=5, color=["tab:red", "tab:blue", "tab:blue", "tab:blue"])
for i in [1,2,3,4]:
axes[i].set_yticks([],[]);
Less than 10% of students said that they never or hardly ever learn in classroom when: Sets clear goals, encourages thinking and reasoning, checks understanding, summarizes previous lessonsa and informs about learning goals.
col = ['ST79Q03', 'ST79Q04', 'ST79Q07', 'ST79Q10']
titles = ["Differentiates Between Students When Giving Tasks",
"Assigns Complex Projects", "Has Students Work in Small Groups",
"Plans Classroom Activities"]
axes = summary_barplot_items(pisa2012, col, "Summary of the parameters for Student Orientation", titles,
nr=2, nc=2,figh=12, figw=5)
for i in [1,3]:
axes[i].set_yticks([],[]);
Above 40% of teacher never or hardly ever take part to student-orientation for the following reasons : Differentiates between students when giving tasks, assigns complex projects, has students work in small groups and plans classroom activities.
col = ['ST79Q05', 'ST79Q11', 'ST79Q12', 'ST79Q17']
titles = ["Gives Feedback",
"Gives Feedback on Strengths and Weaknesses", "Informs about Expectations",
"Tells How to Get Better"]
axes = summary_barplot_items(pisa2012, col, "Summary of the parameters for Formative Assessment", titles,
nr=2, nc=2,figh=12, figw=5)
for i in [1,3]:
axes[i].set_yticks([],[]);
There is no clue onto whether any of teacher formative teacher formative assessment helps the students to get better.
pisa2012['ST77Q06'].unique()
teacher_support_order = ['Never or Hardly Ever', 'Some Lessons', 'Most Lessons', 'Every Lesson']
transform_categorical_column_to_be_ordered(pisa2012, teacher_support_order,
["ST77Q01", "ST77Q02", "ST77Q04", "ST77Q05", "ST77Q06"])
col = ['ST77Q01', 'ST77Q02', 'ST77Q04', 'ST77Q05', "ST77Q06"]
titles = ["Teacher shows interest", "Extra help", "Teacher helps", "Teacher continues", "Express opinions"]
axes = summary_barplot_items(pisa2012, col, "Summary of the parameters for Math Teaching", titles,
nr=1, nc=5,figh=19, figw=5)
for i in [1,2,3,4]:
axes[i].set_yticks([],[]);
Less than 10% of teacher never or hardly ever supports students while more than 30% supports students math teaching doing : Teacher shows interest, extra help, teacher helps, teacher continues and express opinions.
pisa2012['ST80Q01'].unique()
cognitive_activation_order = ['Never or rarely', 'Sometimes', 'Often', 'Always or almost always']
transform_categorical_column_to_be_ordered(pisa2012, cognitive_activation_order,
["ST80Q01", "ST80Q04", "ST80Q05",
"ST80Q06", "ST80Q07", "ST80Q08",
"ST80Q09", "ST80Q10", "ST80Q11"])
col = ["ST80Q01", "ST80Q04", "ST80Q05", "ST80Q06", "ST80Q07", "ST80Q08", "ST80Q09", "ST80Q10", "ST80Q11"]
titles = ["Teacher Encourages to Reflect Problems",
"Gives Problems that Require to Think",
"Asks to Use Own Procedures", "Presents Problems with No Obvious Solutions",
"Presents Problems in Different Contexts", "Helps Learn from Mistakes", "Asks for Explanations",
"Apply What We Learned", "Problems with Multiple Solutions"
]
axes = summary_barplot_items(pisa2012, col, "Summary of the parameters for Cognitive Activation", titles,
nr=3, nc=3,figh=19, figw=10, fig_a_hspace=.8)
for i in [1,2,4,5,7,8]:
axes[i].set_yticks([],[]);
It's not clear how cognitive activation helps to understand student academic performance.
pisa2012['ST83Q01'].unique()
math_teacher_sup_order = ['Strongly disagree', 'Disagree', 'Agree', 'Strongly agree']
transform_categorical_column_to_be_ordered(pisa2012, math_teacher_sup_order,
["ST83Q01", "ST83Q02", "ST83Q03", "ST83Q04"])
col = ["ST83Q01", "ST83Q02", "ST83Q03", "ST83Q04"]
titles = ["Lets Us Know We Have to Work Hard",
"Provides Extra Help When Needed",
"Helps Students with Learning", "Gives Opportunity to Express Opinions"
]
axes = summary_barplot_items(pisa2012, col, "Summary of the parameters for Teacher support", titles,
nr=1, nc=4,figh=19, figw=5)
for i in [1,2,3]:
axes[i].set_yticks([],[]);
More than 40% of students Agree believes teachers support them by doing : Let us know we have to work hard, provvides extra help when needed, helps students with learning and gives opportunity to express opinions.
pisa2012['ST85Q01'].unique()
clsman_order =['Strongly disagree', 'Disagree', 'Agree', 'Strongly agree']
transform_categorical_column_to_be_ordered(pisa2012, clsman_order, ["ST85Q01", "ST85Q02", "ST85Q03", "ST85Q04"])
col = ["ST85Q01", "ST85Q02", "ST85Q03", "ST85Q04"]
titles = ["Students Listen",
"Teacher Keeps Class Orderly",
"Teacher Starts On Time", "Wait Long to <Quiet Down>"
]
axes = summary_barplot_items(pisa2012, col, "Summary of the parameters for Classroom management", titles,
nr=1, nc=4,figh=19, figw=5)
for i in [1,2,3]:
axes[i].set_yticks([],[]);
Except the trait, wait long to quiet down, more than 40% of students believes that teacher classroom management could be picutre as follow : students listen, teacher keeps class orderly, teacher statrts on time.
pisa2012['ST81Q01'].unique()
disclima_order = ['Never or Hardly Ever', 'Some Lessons', 'Most Lessons', 'Every Lesson']
transform_categorical_column_to_be_ordered(pisa2012, disclima_order,
['ST81Q01', 'ST81Q02', 'ST81Q03', 'ST81Q04', 'ST81Q05'])
col = ['ST81Q01', 'ST81Q02', 'ST81Q03', 'ST81Q04', 'ST81Q05']
titles = ["Students Don’t Listen",
"Noise and Disorder",
"Teacher Has to Wait Until its Quiet", "Students Don’t Work Well",
"Students Start Working Late"
]
axes = summary_barplot_items(pisa2012, col, "Summary of the parameters for Disciplinary climate", titles,
nr=1, nc=5,figh=19, figw=5, color=["tab:blue", "tab:green", "tab:blue", "tab:blue"])
for i in [1,2,3,4]:
axes[i].set_yticks([],[]);
Above 40% of students thinks discplinary climate depends on the lessons.
According to the definition of the list of indices to define it we have :
axes = pisa2012[['STUDREL', 'BELONG']].plot(kind='hist', bins=10, sharex=True,
subplots=True, figsize=(8,8), legend=False, color="tab:blue")
axes[0].legend(["Teacher Student Relations"])
axes[1].legend(["Sense of Belonging to School"]);
plt.suptitle(t="Distribution of indices for School climate", fontsize=20, fontweight='bold', color="tab:blue");
The distribution of indices for school climate seems to be normally distributed.
pisa2012['ST86Q01'].unique()
studrel_order = ['Strongly disagree', 'Disagree', 'Agree', 'Strongly agree']
transform_categorical_column_to_be_ordered(pisa2012, studrel_order,
["ST86Q01", "ST86Q02", "ST86Q03", "ST86Q04", "ST86Q05"])
col = ['ST86Q01', 'ST86Q02', 'ST86Q03', 'ST86Q04', 'ST86Q05']
titles = ["Get Along with Teachers",
"Teachers Are Interested",
"Teachers Listen to Students", "Teachers Help Students",
"Teachers Treat Students Fair"
]
axes = summary_barplot_items(pisa2012, col, "Summary of the parameters for Student-Teacher Relation", titles,
nr=1, nc=5,figh=19, figw=5)
for i in [1,2,3,4]:
axes[i].set_yticks([],[]);
More than 60% of students Agree that student-teacher relation could be portray as follow : Get along with teachers, teachers are interested, teacher listen to students, teacher help students, teacher treat students fair.
pisa2012["ST87Q01"].unique()
belong_order = ['Strongly disagree', 'Disagree', 'Agree', 'Strongly agree']
transform_categorical_column_to_be_ordered(pisa2012, belong_order,
["ST87Q01", "ST87Q02", "ST87Q03", "ST87Q04", "ST87Q05",
"ST87Q06", "ST87Q07", "ST87Q08", "ST87Q09"])
col = ["ST87Q01", "ST87Q02", "ST87Q03", "ST87Q04", "ST87Q05", "ST87Q06", "ST87Q07", "ST87Q08", "ST87Q09"]
titles = ["Feel Like Outsider", "Make Friends Easily", "Belong at School", "Feel Awkward at School",
"Liked by Other Students", "Feel Lonely at School", "Feel Happy at School", "Things Are Ideal at School",
"Satisfied at School"]
axes = summary_barplot_items(pisa2012, col, "Summary of the parameters for Sense of Belonging", titles,
nr=2, nc=5,figh=19, figw=10, fig_a_hspace=.4)
for i in [1,2,3,4, 6,7,8,9]:
axes[i].set_yticks([],[]);
plt.delaxes(axes[9]);
The attitudes towards school was covered by two scaled indices based on eight items ST88, ST89.
axes = pisa2012[['ATSCHL', 'ATTLNACT']].plot(kind='hist',
bins=8, figsize=(8,8),
subplots=True,
sharex=False, color="tab:blue")
axes[0].legend(["Learning Outcomes"])
axes[1].legend(["Learning Activities"])
plt.suptitle(t="Distribution of indices for Attitudes towards school",
fontsize=20, fontweight='bold', color="tab:blue");
pisa2012['ST88Q01'].unique()
atschl_order = ['Strongly disagree', 'Disagree', 'Agree', 'Strongly agree']
transform_categorical_column_to_be_ordered(pisa2012, atschl_order,
["ST88Q01", "ST88Q02", "ST88Q03", "ST88Q04"])
col = ["ST88Q01", "ST88Q02", "ST88Q03", "ST88Q04"]
titles = ["Does Little to Prepare Me for Life",
"Waste of Time",
"Gave Me Confidence",
"Useful for Job"
]
axes = summary_barplot_items(pisa2012, col, "Summary of the parameters for Learning Outcomes", titles,
nr=1, nc=4,figh=19, figw=5)
for i in [1,2,3]:
axes[i].set_yticks([],[]);
pisa2012['ST89Q02'].unique()
attlnact_order = ['Strongly disagree', 'Disagree', 'Agree', 'Strongly agree']
transform_categorical_column_to_be_ordered(pisa2012, attlnact_order,
["ST89Q02", "ST89Q03", "ST89Q04", "ST89Q05"])
col = ["ST89Q02", "ST89Q03", "ST89Q04", "ST89Q05"]
titles = ["Helps to Get a Job",
"Prepare for College",
"Enjoy Good Grades",
"Trying Hard is Important"
]
axes = summary_barplot_items(pisa2012, col, "Summary of the parameters for Learning Activities", titles,
nr=1, nc=4,figh=19, figw=5)
for i in [1,2,3]:
axes[i].set_yticks([],[]);
Discuss the distribution(s) of your variable(s) of interest. Were there any unusual points? Did you need to perform any transformations?.
From the univariate explorations we investigated the following indices :
There are more students that are from OECD countries. the relative grade distribution seems to follow a normal distribution. Many parents have completed ISCED level 3A, 3B and 3C.
Parents possesses more literature books per household
There are some parameters showing students attitude toward mathematics but we can really select one that qualify the best.
Teacher behavior seems to behave well or get good feedback from there students occasionally.
No real patterns was observed from students behavior at school.
### Of the features you investigated, were there any unusual distributions? Did you perform any operations on the data to tidy, adjust, or change the form of the data? If so, why did you do this?.
There were some useful features both on the general indices and some parameters that composes the general indices. Yes we change some columns data type from numerical to categorical in order to better understand the data.
grid = sb.PairGrid(data = pisa2012.sample(20000),
vars=["Math", "Reading", "Science"])
grid = grid.map_diag(plt.hist, linewidth=3)
grid = grid.map_lower(sb.kdeplot, linewidths = 2, edgecolor = 'blue', alpha =.5)
grid = grid.map_upper(plt.scatter, linewidths = 2, edgecolor = 'yellow',alpha =.7)
grid.fig.suptitle("Math, Reading, Science score scatter plot", y = 1.02,
fontsize=14, fontweight='bold');
binsize = 10
bins = np.arange(0, pisa2012['PV1MATH'].max()+binsize, binsize)
(pisa2012
.query("ST04Q01 == 'Female' ")
['Math']
.plot(kind='hist',
bins=bins,
title="Distribution of Maths Score by Gender",
figsize=(8,5),
alpha=.5,
label="Female"
)
);
(pisa2012
.query("ST04Q01 == 'Male' ")
['Math']
.plot(kind='hist',
bins=bins,
title="Distribution of Maths Score by Gender",
figsize=(8,5),
alpha=.5,
label='Male'
)
);
plt.xlabel('Maths Score');
plt.ylabel('Number of Students');
plt.legend();
binsize = 10
bins = np.arange(0, pisa2012['PV1READ'].max()+binsize, binsize)
(pisa2012
.query("ST04Q01 == 'Female' ")
['PV1READ']
.plot(kind='hist',
bins=bins,
title="Distribution of Reading Score by Gender",
figsize=(8,5),
alpha=.5,
label="Female"
)
);
(pisa2012
.query("ST04Q01 == 'Male' ")
['PV1READ']
.plot(kind='hist',
bins=bins,
title="Distribution of Reading Score by Gender",
figsize=(8,5),
alpha=.5,
label='Male'
)
);
plt.xlabel('Reading Score');
plt.ylabel('Number of Students');
plt.legend();
binsize = 10
bins = np.arange(0, pisa2012['PV1SCIE'].max()+binsize, binsize)
(pisa2012
.query("ST04Q01 == 'Female' ")
['PV1SCIE']
.plot(kind='hist',
bins=bins,
title="Distribution of Science Score by Gender",
figsize=(8,5),
alpha=.5,
label="Female"
)
);
(pisa2012
.query("ST04Q01 == 'Male' ")
['PV1SCIE']
.plot(kind='hist',
bins=bins,
title="Distribution of Science Score by Gender",
figsize=(8,5),
alpha=.5,
label='Male'
)
);
plt.xlabel('Science Score');
plt.ylabel('Number of Students');
plt.legend();
(pisa2012
.groupby('CNT')
.mean()['PV1MATH']
.sort_values(ascending=True)
.plot(kind='barh',
title="Mean Math Score by Country (red line represents mean)",
figsize=(8,6), color='C0'
)
)
plt.xlabel('Mean Math Score')
plt.ylabel('Country')
plt.axvline(pisa2012['PV1MATH'].mean(), color='r')
plt.show()
(pisa2012
.groupby('CNT')
.mean()['PV1READ']
.sort_values(ascending=True)
.plot(kind='barh',
title="Mean Reading Score by Country (red line represents mean)",
figsize=(8,6), color='C0'
)
)
plt.xlabel('Mean Reading Score')
plt.ylabel('Country')
plt.axvline(pisa2012['PV1READ'].mean(), color='r');
(pisa2012
.groupby('CNT')
.mean()['PV1SCIE']
.sort_values(ascending=True)
.plot(kind='barh',
title="Mean Science Score by Country (red line represents mean)",
figsize=(8,6), color='C0'
)
)
plt.xlabel('Mean Science Score')
plt.ylabel('Country')
plt.axvline(pisa2012['PV1SCIE'].mean(), color='r');
(pd.crosstab(index=pisa2012['CNT'], columns=pisa2012['ST04Q01'])
.sort_values(by=['Female', 'Male'], ascending=True)
.plot(kind='barh',
figsize=(12,10),
)
);
plt.legend(title="Sex")
plt.xlabel('Number of Students');
plt.ylabel("Country");
binsize = 10
bins = np.arange(0, pisa2012['Math'].max()+binsize, binsize)
(pisa2012
.query("OECD == 'OECD' ")
['Math']
.plot(kind='hist',
bins=bins,
title="Distribution of Maths Score by OECD",
figsize=(8,5),
alpha=.5,
label="OECD"
)
);
(pisa2012
.query("OECD == 'Non-OECD' ")
['Math']
.plot(kind='hist',
bins=bins,
title="Distribution of Maths Score by OECD",
figsize=(8,5),
alpha=.5,
label='Non-OECD'
)
);
plt.xlabel('Maths Score');
plt.ylabel('Number of Students');
plt.legend();
binsize = 10
bins = np.arange(0, pisa2012['Reading'].max()+binsize, binsize)
(pisa2012
.query("OECD == 'OECD' ")
['Reading']
.plot(kind='hist',
bins=bins,
title="Distribution of Reading Score by OECD",
figsize=(8,5),
alpha=.5,
label="OECD"
)
);
(pisa2012
.query("OECD == 'Non-OECD' ")
['Reading']
.plot(kind='hist',
bins=bins,
title="Distribution of Reading Score by OECD",
figsize=(8,5),
alpha=.5,
label='Non-OECD'
)
);
plt.xlabel('Reading Score');
plt.ylabel('Number of Students');
plt.legend();
binsize = 10
bins = np.arange(0, pisa2012['Science'].max()+binsize, binsize)
(pisa2012
.query("OECD == 'OECD' ")
['Science']
.plot(kind='hist',
bins=bins,
title="Distribution of Science Score by OECD",
figsize=(8,5),
alpha=.5,
label="OECD"
)
);
(pisa2012
.query("OECD == 'Non-OECD' ")
['Science']
.plot(kind='hist',
bins=bins,
title="Distribution of Science Score by OECD",
figsize=(8,5),
alpha=.5,
label='Non-OECD'
)
);
plt.xlabel('Science Score');
plt.ylabel('Number of Students');
plt.legend();
crosstab_mother = pd.crosstab(pisa2012['GRADE'], pisa2012['ST11Q01'], normalize=True).round(4)
crosstab_mother.plot(kind='barh', color=['tab:orange', "tab:blue"], figsize=(12,5), table=True);
plt.legend(title="Mother at home");
plt.xticks([],[]);
plt.title("Percentage of grade distribtuion with mother home or not");
crosstab_father = pd.crosstab(pisa2012['GRADE'], pisa2012['ST11Q02'], normalize=True,).round(4)
crosstab_father.plot(kind='barh', color=['tab:orange', "tab:blue"], figsize=(12,5), table=True);
plt.legend(title="Father at home");
plt.xticks([],[]);
plt.title("Percentage of grade distribtuion with father home or not");
crosstab_immig = pd.crosstab(pisa2012['GRADE'],pisa2012['IMMIG'], normalize=True,).round(4)
crosstab_immig.plot(kind='barh', color=["tab:blue", "tab:orange", "tab:green"], figsize=(12,5), table=True);
plt.legend(title="Immigration status");
plt.xticks([],[]);
plt.title("Percentage of grade distribtuion with father home or not");
crosstab_sex = pd.crosstab(pisa2012['GRADE'], pisa2012['ST04Q01'], normalize=True,).round(4)
crosstab_sex.plot(kind='barh', color=['tab:orange', "tab:blue"],figsize=(12,5), table=True);
plt.legend(title="Sex");
plt.xticks([],[]);
plt.title("Percentage of grade distribtuion according to the sex");
crosstab_oecd_sex = pd.crosstab(pisa2012['OECD'], pisa2012['ST04Q01'], normalize=True,)
crosstab_oecd_sex.plot(kind='bar', color=['tab:orange', "tab:blue"], rot=0);
plt.legend(title="Sex");
for i,v in enumerate(crosstab_oecd_sex["Female"]):
plt.text(i, v, str(round(v, 4)), fontdict=dict(color='black', fontsize=12), ha='right');
for i,v in enumerate(crosstab_oecd_sex["Male"]):
plt.text(i, v, str(round(v, 4)), fontdict=dict(color='black', fontsize=12), ha='left');
plt.ylabel("Percentage");
plt.title("Percentage of sex distribtuion according to the OECD countries");
crosstab_book_oecd = pd.crosstab(pisa2012['ST28Q01'],pisa2012['OECD'], normalize=True,)
crosstab_book_oecd.plot(kind='barh', color=['tab:orange', "tab:blue"],);
plt.legend(title="");
for i,v in enumerate(crosstab_book_oecd["Non-OECD"]):
plt.text(v, i, str(round(v, 4)), fontdict=dict(color='black', fontsize=12), va='top');
for i,v in enumerate(crosstab_book_oecd["OECD"]):
plt.text(v, i, str(round(v, 4)), fontdict=dict(color='black', fontsize=12), va='bottom');
plt.xlabel("Percentage");
plt.ylabel("");
plt.title("Percentage of books distribtuion according to the OECD countries");
fig = plt.figure(figsize =(30,15))
ax1 = fig.add_subplot(231)
sb.barplot(data=pisa2012, y="GRADE", x="BFMJ2", ci='sd', color="tab:blue", orient='h');
plt.xlabel("Father Highest Occupational Status");
plt.ylabel("GRADE");
ax2 = fig.add_subplot(232)
sb.barplot(data=pisa2012, y="GRADE", x="BMMJ1", ci='sd', color="tab:blue", orient='h');
plt.xlabel("Mother Highest Occupational Status");
plt.ylabel("GRADE");
ax3 = fig.add_subplot(233)
sb.barplot(data=pisa2012, y="GRADE", x="HISEI", ci='sd', color="tab:blue", orient='h');
plt.xlabel("Parents Highest Occupational Status");
plt.ylabel("GRADE");
ax4 = fig.add_subplot(234)
crosstab_grade_fisced = pd.crosstab(pisa2012['GRADE'],pisa2012['FISCED'], normalize=True,).round(4)
crosstab_grade_fisced.plot(kind='barh', table=False, ax=ax4);
ax4.set_xlabel("Percentage")
plt.legend(title="Educational level of father (ISCED)");
ax5 = fig.add_subplot(235)
crosstab_grade_misced = pd.crosstab(pisa2012['GRADE'],pisa2012['MISCED'], normalize=True,).round(4)
crosstab_grade_misced.plot(kind='barh', table=False, ax=ax5);
ax5.set_xlabel("Percentage")
plt.legend(title="Educational level of Mother (ISCED)");
ax6 = fig.add_subplot(236)
crosstab_grade_hisced = pd.crosstab(pisa2012['GRADE'],pisa2012['HISCED'], normalize=True,).round(4)
crosstab_grade_hisced.plot(kind='barh', table=False, ax=ax6);
ax6.set_xlabel("Percentage")
plt.legend(title="Educational level of parents (ISCED)");
plt.subplots_adjust(wspace=0.4);
def summary_hue_barplot_items(df:pd.DataFrame, col:list, hue_col:str,suptitle:str, title:list,
nr=1, nc=1,
figh=5, figw=12,
fig_a_top=.8, fig_a_wspace=.2, fig_a_hspace=.9,
color=["tab:blue"], vert=True):
"""
Given some parameters such as the dataframe, the list of columns to plot and the details about
how many axes to use plot and adjust the plotting parameters.
"""
fig, axes = plt.subplots(nrows=nr, ncols=nc, figsize=(figh, figw))
fig.suptitle(t=suptitle, x = 0.5, y = 0.95, fontsize = 20, fontweight='bold', color = 'tab:blue')
axes_ = axes.flatten()
if vert:
for idx, c in enumerate(col):
ax = sb.countplot(data=df, x=c, hue=hue_col, ax=axes_[idx])
ax.get_legend().remove()
ax.set_title(title);
ax.set_xlabel("");
fig.subplots_adjust(top=fig_a_top, wspace=fig_a_wspace , hspace=fig_a_hspace);
else:
for idx, c in enumerate(col):
ax = sb.countplot(data=df, y=c, hue=hue_col, ax=axes_[idx])
ax.get_legend().remove()
ax.set_title(title[idx]);
ax.set_ylabel("");
fig.subplots_adjust(top=fig_a_top, wspace=fig_a_wspace , hspace=fig_a_hspace);
return axes_
col = ['ST29Q01', 'ST29Q03', 'ST29Q04', 'ST29Q06']
titles = ["Enjoy Reading", "Look Forward to Lessons", "Enjoy Maths", "Interested"]
axes = summary_hue_barplot_items(pisa2012, col, 'ST04Q01',
"Summary of the parameters for Mathematics Interest VS Sex", titles,
nr=1, nc=4,figh=15, figw=5, color=["tab:blue"],
vert=False, fig_a_hspace=.4, fig_a_wspace=.2)
for i in [1,2,3]:
axes[i].set_yticks([],[]);
axes[3].legend(title="Sex",bbox_to_anchor=(1.4, 1));
col = ['ST44Q01', 'ST44Q03', 'ST44Q04', 'ST44Q05', 'ST44Q07', 'ST44Q08']
titles = ["Not Good at Maths Problems", "Teacher Did Not Explain Well", "Bad Guesses",
"Material Too Hard", 'Teacher Didnt Get Students Interested', 'Unlucky']
axes = summary_hue_barplot_items(pisa2012, col, "ST04Q01",
"Summary of the parameters for Failure in mathematics VS Sex", titles,
nr=2, nc=3, figh=19, figw=10,
vert=False, fig_a_hspace=.4, fig_a_wspace=.2)
for i in [1,2,4,5]:
axes[i].set_yticks([],[]);
axes[2].legend(title="Sex",bbox_to_anchor=(1.4, 0));
col = ['ST46Q01', 'ST46Q02', 'ST46Q03', 'ST46Q04', 'ST46Q05', 'ST46Q06', 'ST46Q07', 'ST46Q08', 'ST46Q09']
titles = ["Homework Completed in Time", "Work Hard on Homework", "Prepared for Exams",
"Study Hard for Quizzes", 'Study Until I Understand Everything', "Pay Attention in Classes",
'Listen in Classes', "Avoid Distractions When Studying", "Keep Work Organized"]
axes = summary_hue_barplot_items(pisa2012, col, "ST04Q01","Summary of the parameters for Mathematics work ethic VS Sex", titles,
nr=2, nc=5, figh=19, figw=10,
vert=False, fig_a_hspace=.4, fig_a_wspace=.2)
for i in [1,2,3,4,6,7,8,9]:
axes[i].set_yticks([],[]);
plt.delaxes(axes[9]);
axes[4].legend(title="Sex",bbox_to_anchor=(1.5, 0));
col = ['ST49Q01', 'ST49Q02', 'ST49Q03', 'ST49Q04', 'ST49Q05', 'ST49Q06', 'ST49Q07', 'ST49Q09']
titles = ["Homework Completed in Time", "Work Hard on Homework", "Prepared for Exams",
"Study Hard for Quizzes", 'Study Until I Understand Everything', "Pay Attention in Classes",
'Listen in Classes', "Avoid Distractions When Studying", "Keep Work Organized"]
axes = summary_hue_barplot_items(pisa2012, col, "ST04Q01","Summary of the parameters for Mathematics behavior VS Sex", titles,
nr=2, nc=4, figh=19, figw=10,
vert=False, fig_a_hspace=.4, fig_a_wspace=.2)
for i in [1,2,3,5,6,7]:
axes[i].set_yticks([],[]);
axes[3].legend(title="Sex",bbox_to_anchor=(1.5, 0));
col = ['ST79Q01', 'ST79Q02', 'ST79Q06', 'ST79Q08', 'ST79Q15']
titles = ["Sets Clear Goals", "Encourages Thinking and Reasoning", "Checks Understanding", "Summarizes Previous Lessons",
"Informs about Learning Goals"]
axes = summary_hue_barplot_items(pisa2012, col,
"ST04Q01", "Summary of the parameters for Teacher Behavior VS Sex",
titles, vert=False,
nr=1, nc=5,figh=19, figw=5)
for i in [1,2,3,4]:
axes[i].set_yticks([],[]);
axes[4].legend(title="Sex",bbox_to_anchor=(1.5, 1));
col = ['ST81Q01', 'ST81Q02', 'ST81Q03', 'ST81Q04', 'ST81Q05']
titles = ["Students Don’t Listen",
"Noise and Disorder",
"Teacher Has to Wait Until its Quiet", "Students Don’t Work Well",
"Students Start Working Late"
]
axes = summary_hue_barplot_items(pisa2012, col, "ST04Q01",
"Summary of the parameters for Disciplinary climate VS Sex", titles,
nr=1, nc=5,figh=19, figw=5, vert=False)
for i in [1,2,3,4]:
axes[i].set_yticks([],[]);
axes[4].legend(title="Sex",bbox_to_anchor=(1.5, 1));
col = ["ST89Q02", "ST89Q03", "ST89Q04", "ST89Q05"]
titles = ["Helps to Get a Job",
"Prepare for College",
"Enjoy Good Grades",
"Trying Hard is Important"
]
axes = summary_hue_barplot_items(pisa2012, col, "ST04Q01",
"Summary of the parameters for Learning Activities VS Sex", titles, vert=False,
nr=1, nc=4,figh=19, figw=5)
for i in [1,2,3]:
axes[i].set_yticks([],[]);
axes[3].legend(title="Sex",bbox_to_anchor=(1.5, 1));
fig = plt.figure(figsize=(15,5))
ax1 = fig.add_subplot(141)
pd.crosstab(pisa2012['ST89Q02'], pisa2012['GRADE'], normalize=True).plot(kind='barh',
title="Helps to Get a Job",
stacked=False, rot=0,
ax=ax1, legend=False);
ax1.set_ylabel("")
ax1.set_xlabel("Percentage")
ax2 = fig.add_subplot(142)
pd.crosstab(pisa2012['ST89Q03'], pisa2012['GRADE'], normalize=True).plot(kind='barh',
title="Prepare for College",
stacked=False,
ax=ax2, legend=False);
ax2.set_ylabel("")
ax2.set_yticks([],[])
ax2.set_xlabel("Percentage")
ax3 = fig.add_subplot(143)
pd.crosstab(pisa2012['ST89Q04'], pisa2012['GRADE'], normalize=True).plot(kind='barh',
title= "Enjoy Good Grades",
stacked=False, rot=15,
ax=ax3, legend=False);
ax3.set_ylabel("")
ax3.set_yticks([],[])
ax3.set_xlabel("Percentage")
ax4 = fig.add_subplot(144)
pd.crosstab(pisa2012['ST89Q05'], pisa2012['GRADE'], normalize=True).plot(kind='barh',
title="Trying Hard is Important",
stacked=False, rot=15, ax=ax4);
ax4.legend(title="GRADE",bbox_to_anchor=(1.5, 1))
ax4.set_ylabel("");
ax4.set_xlabel("Percentage")
ax4.set_yticks([],[]);
fig.suptitle(t="Summary of the parameters for Learning Activities VS Grade",
x = 0.5, y = 0.95, fontsize = 20, fontweight='bold', color = 'tab:blue');
plt.subplots_adjust(top=0.8);
ax4.legend(title="GRADE",bbox_to_anchor=(1.5, 1));
fig = plt.figure(figsize=(30,15))
ax1 = fig.add_subplot(241)
pd.crosstab(pisa2012['ST81Q01'], pisa2012['GRADE'], normalize=True).plot(kind='barh',
title="Students Don’t Listen",
stacked=False, rot=0,
ax=ax1, legend=False);
ax1.set_ylabel("")
ax1.set_xlabel("Percentage")
ax2 = fig.add_subplot(242)
pd.crosstab(pisa2012['ST81Q02'], pisa2012['GRADE'], normalize=True).plot(kind='barh',
title="Noise and Disorder",
stacked=False,
ax=ax2, legend=False);
ax2.set_ylabel("")
ax2.set_yticks([],[])
ax2.set_xlabel("Percentage")
ax3 = fig.add_subplot(243)
pd.crosstab(pisa2012['ST81Q03'], pisa2012['GRADE'], normalize=True).plot(kind='barh',
title= "Teacher Has to Wait Until its Quiet",
stacked=False, rot=0,
ax=ax3, legend=False);
ax3.set_ylabel("")
ax3.set_yticks([],[])
ax3.set_xlabel("Percentage")
ax4 = fig.add_subplot(244)
pd.crosstab(pisa2012['ST81Q05'], pisa2012['GRADE'], normalize=True).plot(kind='barh',
title="Students Don’t Work Well",
stacked=False, rot=0, ax=ax4, legend=False);
ax4.legend(title="GRADE",bbox_to_anchor=(1.4, 0))
ax4.set_ylabel("");
ax4.set_xlabel("Percentage")
ax4.set_yticks([],[]);
ax4.legend(title="Sex",bbox_to_anchor=(1.5, 0));
ax5 = fig.add_subplot(245)
pd.crosstab(pisa2012['ST81Q05'], pisa2012['GRADE'], normalize=True).plot(kind='barh',
title="Students Start Working Late",
stacked=False, rot=0, ax=ax5, legend=False);
ax5.set_xlabel("Percentage")
ax5.set_ylabel("");
fig.suptitle(t="Summary of the parameters for Disciplinary Climate VS Grade",
x = 0.5, y = 0.95, fontsize = 20, fontweight='bold', color = 'tab:blue');
plt.subplots_adjust(top=0.85, hspace=0.2);
fig = plt.figure(figsize=(30,15))
ax1 = fig.add_subplot(241)
pd.crosstab(pisa2012['ST79Q01'], pisa2012['GRADE'], normalize=True).plot(kind='barh',
title="Sets Clear Goals",
stacked=False, rot=0,
ax=ax1, legend=False);
ax1.set_ylabel("")
ax1.set_xlabel("Percentage")
ax2 = fig.add_subplot(242)
pd.crosstab(pisa2012['ST79Q02'], pisa2012['GRADE'], normalize=True).plot(kind='barh',
title="Encourages Thinking and Reasoning",
stacked=False,
ax=ax2, legend=False);
ax2.set_ylabel("")
ax2.set_yticks([],[])
ax2.set_xlabel("Percentage")
ax3 = fig.add_subplot(243)
pd.crosstab(pisa2012['ST79Q06'], pisa2012['GRADE'], normalize=True).plot(kind='barh',
title= "Checks Understanding",
stacked=False, rot=0,
ax=ax3, legend=False);
ax3.set_ylabel("")
ax3.set_yticks([],[])
ax3.set_xlabel("Percentage")
ax4 = fig.add_subplot(244)
pd.crosstab(pisa2012['ST79Q08'], pisa2012['GRADE'], normalize=True).plot(kind='barh',
title="Summarizes Previous Lessons",
stacked=False, rot=0, ax=ax4, legend=False);
ax4.legend(title="GRADE",bbox_to_anchor=(1.4, 0));
ax4.set_ylabel("");
ax4.set_xlabel("Percentage")
ax4.set_yticks([],[]);
ax5 = fig.add_subplot(245)
pd.crosstab(pisa2012['ST79Q15'], pisa2012['GRADE'], normalize=True).plot(kind='barh',
title="Informs about Learning Goals",
stacked=False, rot=0, ax=ax5, legend=False);
ax5.set_xlabel("Percentage")
ax5.set_ylabel("");
fig.suptitle(t="Summary of the parameters for Teacher-Directed Instruction VS Grade",
x = 0.5, y = 0.95, fontsize = 20, fontweight='bold', color = 'tab:blue');
plt.subplots_adjust(top=0.85, hspace=0.2);
fig = plt.figure(figsize=(30,15))
ax1 = fig.add_subplot(241)
pd.crosstab(pisa2012['ST49Q01'], pisa2012['GRADE'], normalize=True).plot(kind='barh',
title="Talk about Maths with Friends",
stacked=False, rot=0,
ax=ax1, legend=False);
ax1.set_ylabel("")
ax1.set_xlabel("Percentage")
ax2 = fig.add_subplot(242)
pd.crosstab(pisa2012['ST49Q02'], pisa2012['GRADE'], normalize=True).plot(kind='barh',
title="Help Friends with Maths",
stacked=False,
ax=ax2, legend=False);
ax2.set_ylabel("")
ax2.set_yticks([],[])
ax2.set_xlabel("Percentage")
ax3 = fig.add_subplot(243)
pd.crosstab(pisa2012['ST49Q03'], pisa2012['GRADE'], normalize=True).plot(kind='barh',
title= "<Extracurricular> Activity",
stacked=False, rot=0,
ax=ax3, legend=False);
ax3.set_ylabel("")
ax3.set_yticks([],[])
ax3.set_xlabel("Percentage")
ax4 = fig.add_subplot(244)
pd.crosstab(pisa2012['ST49Q04'], pisa2012['GRADE'], normalize=True).plot(kind='barh',
title="Participate in Competitions",
stacked=False, rot=0, ax=ax4, legend=False);
ax4.legend(title="GRADE",bbox_to_anchor=(1.4, 0))
ax4.set_ylabel("");
ax4.set_xlabel("Percentage")
ax4.set_yticks([],[]);
ax5 = fig.add_subplot(245)
pd.crosstab(pisa2012['ST49Q05'], pisa2012['GRADE'], normalize=True).plot(kind='barh',
title='Study More Than 2 Extra Hours a Day',
stacked=False, rot=0, ax=ax5, legend=False);
ax5.set_xlabel("Percentage")
ax5.set_ylabel("");
ax6 = fig.add_subplot(246)
pd.crosstab(pisa2012['ST49Q06'], pisa2012['GRADE'], normalize=True).plot(kind='barh',
title="Play Chess",
stacked=False, rot=0, ax=ax6, legend=False);
ax6.set_ylabel("");
ax6.set_xlabel("Percentage")
ax6.set_yticks([],[]);
ax7 = fig.add_subplot(247)
pd.crosstab(pisa2012['ST49Q07'], pisa2012['GRADE'], normalize=True).plot(kind='barh',
title='Computer programming',
stacked=False, rot=0, ax=ax7, legend=False);
ax7.set_ylabel("");
ax7.set_xlabel("Percentage")
ax7.set_yticks([],[]);
ax8 = fig.add_subplot(248)
pd.crosstab(pisa2012['ST49Q09'], pisa2012['GRADE'], normalize=True).plot(kind='barh',
title="Participate in Math Club",
stacked=False, rot=0, ax=ax8, legend=False);
ax8.set_ylabel("");
ax8.set_xlabel("Percentage")
ax8.set_yticks([],[]);
fig.suptitle(t="Summary of the parameters for Math Behavior VS Grade",
x = 0.5, y = 0.95, fontsize = 20, fontweight='bold', color = 'tab:blue');
plt.subplots_adjust(top=0.85, hspace=0.2);
fig = plt.figure(figsize=(30,15))
ax1 = fig.add_subplot(341)
pd.crosstab(pisa2012['ST46Q01'], pisa2012['GRADE'], normalize=True).plot(kind='barh',
title="Homework Completed in Time",
stacked=False, rot=0,
ax=ax1, legend=False);
ax1.set_ylabel("")
ax1.set_xlabel("Percentage")
ax2 = fig.add_subplot(342)
pd.crosstab(pisa2012['ST46Q02'], pisa2012['GRADE'], normalize=True).plot(kind='barh',
title="Work Hard on Homework",
stacked=False,
ax=ax2, legend=False);
ax2.set_ylabel("")
ax2.set_yticks([],[])
ax2.set_xlabel("Percentage")
ax3 = fig.add_subplot(343)
pd.crosstab(pisa2012['ST46Q03'], pisa2012['GRADE'], normalize=True).plot(kind='barh',
title= "Prepared for Exams",
stacked=False, rot=0,
ax=ax3, legend=False);
ax3.set_ylabel("")
ax3.set_yticks([],[])
ax3.set_xlabel("Percentage")
ax4 = fig.add_subplot(344)
pd.crosstab(pisa2012['ST46Q04'], pisa2012['GRADE'], normalize=True).plot(kind='barh',
title="Study Hard for Quizzes",
stacked=False, rot=0, ax=ax4, legend=False);
ax4.legend(title="GRADE",bbox_to_anchor=(1.4, 0))
ax4.set_ylabel("");
ax4.set_xlabel("Percentage")
ax4.set_yticks([],[]);
ax5 = fig.add_subplot(345)
pd.crosstab(pisa2012['ST46Q05'], pisa2012['GRADE'], normalize=True).plot(kind='barh',
title='Study Until I Understand Everything',
stacked=False, rot=0, ax=ax5, legend=False);
ax5.set_xlabel("Percentage")
ax5.set_ylabel("");
ax6 = fig.add_subplot(346)
pd.crosstab(pisa2012['ST46Q06'], pisa2012['GRADE'], normalize=True).plot(kind='barh',
title="Pay Attention in Classes",
stacked=False, rot=0, ax=ax6, legend=False);
ax6.set_ylabel("");
ax6.set_xlabel("Percentage")
ax6.set_yticks([],[]);
ax7 = fig.add_subplot(347)
pd.crosstab(pisa2012['ST46Q07'], pisa2012['GRADE'], normalize=True).plot(kind='barh',
title='Listen in Classes',
stacked=False, rot=0, ax=ax7, legend=False);
ax7.set_ylabel("");
ax7.set_xlabel("Percentage")
ax7.set_yticks([],[]);
ax8 = fig.add_subplot(348)
pd.crosstab(pisa2012['ST46Q08'], pisa2012['GRADE'], normalize=True).plot(kind='barh',
title="Avoid Distractions When Studying",
stacked=False, rot=0, ax=ax8, legend=False);
ax8.set_ylabel("");
ax8.set_xlabel("Percentage")
ax8.set_yticks([],[]);
ax9 = fig.add_subplot(349)
pd.crosstab(pisa2012['ST46Q09'], pisa2012['GRADE'], normalize=True).plot(kind='barh',
title="Keep Work Organized",
stacked=False, rot=0, ax=ax9, legend=False);
ax9.set_ylabel("");
ax9.set_xlabel("Percentage")
fig.suptitle(t="Summary of the parameters for Math work ethic VS Grade",
x = 0.5, y = 0.95, fontsize = 20, fontweight='bold', color = 'tab:blue');
plt.subplots_adjust(top=0.85, hspace=0.4);
fig = plt.figure(figsize=(30,15))
ax1 = fig.add_subplot(241)
pd.crosstab(pisa2012['ST44Q01'], pisa2012['GRADE'], normalize=True).plot(kind='barh',
title="Not Good at Maths Problems",
stacked=False, rot=0,
ax=ax1, legend=False);
ax1.set_ylabel("")
ax1.set_xlabel("Percentage")
ax2 = fig.add_subplot(242)
pd.crosstab(pisa2012['ST44Q03'], pisa2012['GRADE'], normalize=True).plot(kind='barh',
title="Teacher Did Not Explain Well",
stacked=False,
ax=ax2, legend=False);
ax2.set_ylabel("")
ax2.set_yticks([],[])
ax2.set_xlabel("Percentage")
ax3 = fig.add_subplot(243)
pd.crosstab(pisa2012['ST44Q04'], pisa2012['GRADE'], normalize=True).plot(kind='barh',
title="Bad Guesses",
stacked=False, rot=0,
ax=ax3, legend=False);
ax3.set_ylabel("")
ax3.set_yticks([],[])
ax3.set_xlabel("Percentage")
ax4 = fig.add_subplot(244)
pd.crosstab(pisa2012['ST44Q05'], pisa2012['GRADE'], normalize=True).plot(kind='barh',
title="Material Too Hard",
stacked=False, rot=0, ax=ax4, legend=False);
ax4.legend(title="GRADE",bbox_to_anchor=(1.4, 0))
ax4.set_ylabel("");
ax4.set_xlabel("Percentage")
ax4.set_yticks([],[]);
ax5 = fig.add_subplot(245)
pd.crosstab(pisa2012['ST44Q07'], pisa2012['GRADE'], normalize=True).plot(kind='barh',
title='Teacher Didnt Get Students Interested',
stacked=False, rot=0, ax=ax5, legend=False);
ax5.set_ylabel("");
ax5.set_xlabel("Percentage")
ax6 = fig.add_subplot(246)
pd.crosstab(pisa2012['ST44Q08'], pisa2012['GRADE'], normalize=True).plot(kind='barh',
title='Unlucky',
stacked=False, rot=0, ax=ax6, legend=False);
ax6.set_xlabel("Percentage")
ax6.set_ylabel("");
ax6.set_yticks([],[]);
fig.suptitle(t="Summary of the parameters for Failure in Math VS Grade",
x = 0.5, y = 0.95, fontsize = 20, fontweight='bold', color = 'tab:blue');
plt.subplots_adjust(top=0.85, hspace=0.3);
Talk about some of the relationships you observed in this part of the investigation. How did the feature(s) of interest vary with other features in the dataset?.
Plotting many of the main features against the region (OECD countries), sex, grade. We learned that the grade has an impact on some parameters of the failure in mathematics and mathematics work ethic i.e very few students has excellent grades because of their work ethic.
Did you observe any interesting relationships between the other features (not the main feature(s) of interest)?.
The distribution of sex and oecd countries among the population of study is respectively almost equal and unbalanced.
> Create plots of three or more variables to investigate your data even further. Make sure that your investigations are justified, and follow from your work in the previous sections..
# FacetGrid for distribution of maths score according to OECD country
binsize = 20
pisa2012_oecd = pisa2012.query("OECD == 'OECD' ")
bins = np.arange(0, pisa2012_oecd['Math'].max()+binsize, binsize)
g = sb.FacetGrid(pisa2012_oecd, col='CNT', col_wrap=5,
margin_titles=True, hue='ST04Q01')
g.map(plt.hist, 'Math', bins = bins, alpha=0.5)
g.fig.subplots_adjust(top=0.95)
g.fig.suptitle('Distribution of Math Score per Gender of OECD Country');
g.add_legend();
# FacetGrid for distribution of maths score according to OECD country
binsize = 20
pisa2012_non_oecd = pisa2012.query("OECD == 'Non-OECD' ")
bins = np.arange(0, pisa2012_non_oecd['Math'].max()+binsize, binsize)
g = sb.FacetGrid(pisa2012_non_oecd, col='CNT', col_wrap=5,
margin_titles=True, hue='ST04Q01')
g.map(plt.hist, 'Math', bins = bins, alpha=0.5)
g.fig.subplots_adjust(top=.85)
g.fig.suptitle('Distribution of Math Score per Gender of Non-OECD Country');
g.add_legend();
fig = plt.figure(figsize=(30,15))
ax1 = fig.add_subplot(321)
ax1 = sb.barplot(x = pisa2012['HOMEPOS'], y=pisa2012['GRADE'], hue=pisa2012['OECD'], ax=ax1);
ax1.legend([]);
ax1.set_xlabel("Home Possessions");
ax2 = fig.add_subplot(322)
ax2 = sb.barplot(x = pisa2012['HOMEPOS'], y=pisa2012['ST28Q01'], hue=pisa2012['OECD'], ax=ax2);
ax2.legend([]);
ax2.set_ylabel("");
ax2.set_xlabel("Home Possessions");
ax3 = fig.add_subplot(323)
ax3 = sb.barplot(x = pisa2012['HEDRES'], y=pisa2012['GRADE'], hue=pisa2012['OECD'], ax=ax3);
ax3.legend([]);
ax3.set_xlabel("Home Educational Resources");
ax4 = fig.add_subplot(324)
ax4 = sb.barplot(x = pisa2012['HEDRES'], y=pisa2012['ST28Q01'], hue=pisa2012['OECD'], ax=ax4);
ax4.legend(bbox_to_anchor=(1.2, 1));
ax4.set_ylabel("");
ax4.set_xlabel("Home Educational Resources");
ax5 = fig.add_subplot(325)
ax5 = sb.barplot(x = pisa2012['HOMEPOS'], y=pisa2012['GRADE'], hue=pisa2012['ST04Q01'], ax=ax5);
ax5.legend([]);
ax5.set_xlabel("Home Possessions");
ax6 = fig.add_subplot(326)
ax6 = sb.barplot(x = pisa2012['HEDRES'], y=pisa2012['GRADE'], hue=pisa2012['ST04Q01'], ax=ax6);
ax6.legend(bbox_to_anchor=(1.2, 0.5));
ax6.set_xlabel("Home Educational Resources");
fig.suptitle(t="Summary of Home possessions and Home Educational resources VS Grade and Sex",
x = 0.5, y = 0.95, fontsize = 20, fontweight='bold', color = 'tab:blue');
plt.subplots_adjust(top=0.85, hspace=0.5);
### Talk about some of the relationships you observed in this part of the investigation. Were there features that strengthened each other in terms of looking at your feature(s) of interest?.
Students oecd countries play a role in their grade.
### Were there any interesting or surprising interactions between features?.
Performance in mathematics is not gender based
You can write a summary of the main findings and reflect on the steps taken during the data exploration.
The family wealth of students who are in OECD countries contribute to their mathematics performance as opposed to those who are in non OECD countries.
> Remove all Tips mentioned above, before you convert this notebook to PDF/HTML> At the end of your report, make sure that you export the notebook as an html file from the File > Download as... > HTML or PDF menu. Make sure you keep track of where the exported file goes, so you can put it in the same folder as this notebook for project submission. Also, make sure you remove all of the quote-formatted guide notes like this one before you finish your report!
</span>.
Some visualizations inspiration in the notebook come from here